H-Packer: Holographic Rotationally Equivariant Convolutional Neural Network for Protein Side-Chain Packing

Accurately modeling protein 3D structure is essential for the design of functional proteins. An important sub-task of structure modeling is protein side-chain packing: predicting the conformation of side-chains (rotamers) given the protein’s backbone structure and amino-acid sequence. Conventional approaches for this task rely on expensive sampling procedures over hand-crafted energy functions and rotamer libraries. Recently, several deep learning methods have been developed to tackle the problem in a data-driven way, albeit with vastly different formulations (from image-to-image translation to directly predicting atomic coordinates). Here, we frame the problem as a joint regression over the side-chains’ true degrees of freedom: the dihedral χ angles. We carefully study possible objective functions for this task, while accounting for the underlying symmetries of the task. We propose Holographic Packer (H-Packer), a novel two-stage algorithm for side-chain packing built on top of two light-weight rotationally equivariant neural networks. We evaluate our method on CASP13 and CASP14 targets. H-Packer is computationally efficient and shows favorable performance against conventional physics-based algorithms and is competitive against alternative deep learning solutions.


Introduction
Proteins are macromolecules composed of residues (amino-acids) that are linked consecutively to form an amino-acid sequence.Each residue is conceptually divided into two parts: (i) a backbone structure common to all amino acids, which is comprised of the alpha carbon (C-α) bounded to an amino group (-NH 2 ) and a carboxyl group (-COOH); and (ii) a residue-specific side-chain.Backbones are connected by peptide bonds between the amino and carboxyl groups of consecutive residues.Physical interactions between the freely-moving side-chains cause the protein chain to fold into a complex 3D structure, which confers the protein its function.
Conceptually, a protein's full atomic structure can be divided into its backbone structure (the coordinates of its backbone atoms) and its side-chains conformations (the coordinates of its side-chain atoms).Side-chain conformations are relatively flexible, while the backbone structure is more rigid and confers the protein its main 3D topology, and thus, its main function.Nonetheless, the interaction between a protein's backbone and side-chains is essential for the stability of the fold and protein function.
Determining amino acid side-chain conformations in a protein, known as Protein Side-Chain Packing (or Rotamer Packing), is an essential step in protein folding and the de-novo design of proteins.Computational approaches to protein folding often divide the structure inference problem into two steps: first, they characterize the rigid backbone structure, and then they pack the side-chains associated with the amino acids at each residue.The flexibility of the side-chain makes the search in the space of possible conformations inevitably complex and computationally expensive.The de-novo protein design protocols also rely on similar logical steps: Often an amino acid sequence compatible with a desirable backbone structure is to be inferred (designed) [1] and then the associated side-chains should be packed to form the full atomic composition of a protein.
Many of the conventional methods for side-chain packing rely on physical models through which they find a rotamer that minimizes a physically-reasoned heuristic energy of the protein fold [2,3,4].However, these computational methods often lack accuracy and speed in their predictions.As deep learning makes strides in protein science, there is a growing effort in developing machine learning methods for rotamer packing.Among these methods is DLPacker [5], which treats the packing problem as an image transformation.This algorithm characterizes the local environment of a given amino acid backbone within a structure as a 3D image, and uses this model to predict the atomic coordinates of the side chain.It then compares the predicted side-chain to a pre-set library of rotamers to select the closest conformation.AttnPacker [6], a more recently developed method, uses a deep graph attention network to model the local geometry of a residue within a structure and is trained to predict the coordinates of the side-chain atoms.Recently, diffusion models over side-chain torsional angles have also being applied, such as DiffPack [7].
Here, we tackle the problem of side-chain packing by learning to directly regress over χ (torsional) angles, which are main degrees of freedom determining side-chain conformations.We derive and discuss three possible parameterizations of the χ angles, ultimately settling on regressing over the Sine and Cosine transforms of the angles.We introduce Holographic-Packer (H-Packer), a deep learning method that packs rotamers by first predicting candidate χ angles from backbone and sequence, and then refines the predictions with a model trained on full-atom structures.Our approach relies on our previously developed holographic convolutions neural network (H-CNN) to characterize amino acid preferences, given their local atomic environment within a structure [8].H-CNN, and by extension H-packer, are locally rotationally (i.e., SO(3)) equivariant, in that they can physically reason about the local geometry of protein structures.Specifically, they achieve their rotational equivariance by operating fully in the spherical Fourier space.
By directly predicting the side-chain χ-angles, H-packer does not rely on comparing its output with a pre-set library of rotamers, making it computationally more efficient than methods like DLPacker.Furthermore, H-Packer is light-weight (2×3M parameters vs. 208M of AttnPacker) and requires few resources to train (single vs. multiple GPUs for diffusion models like DiffPack).We evaluate the packing performance of H-Packer on standard datasets, and show that it has generally better performance than conventional physics-based methods, and competitive against machine learning solutions.In general, our results suggest that H-Packer has learned complementary features to alternative methods.Our code is freely available at https://github.com/gvisani/hpacker.

Methods
In this work, we study the problem of amino-acid side-chain packing using rotationally equivariant neural networks.We introduce H-Packer, a novel yet simple algorithm that predicts side-chain conformations by jointly predicting the values of the key degrees of freedom of a side-chains, i.e., its χ angles.

Modeling side-chain conformations with χ Angles
While amino acids are composed of a maximum of 10 heavy atoms (in the case of Tryptophan), their 3D conformations can be uniquely described by the value of at most 4 dihedral angles, referred to as the χ angles (Figure 1A).This reduction in the number of degrees of freedom is granted due to the physical constraints posed on the remaining internal coordinates (bond angles, bond lengths, and dihedral angles -redundant internal coordinates).Specifically, the inter-atomic physical interactions within amino acids often constrain these redundant coordinates to a constant, or a well-defined function of the residue's χ angles.Therefore, predicting χ angles is the key step in side-chain packing.
H-packer addresses the side-chain packing problem in two steps: (i) it predicts the value of χ angles, and (ii) it reconstructs the atomic coordinates using the predicted χ angles and the constrained values of the redundant internal coordinates.Specifically, we evaluate the redundant internal coordinates from a subset of training data (1,700 structures) by leveraging the internal_coords feature of the biopython package.We empirically verified that the distributions of values were Gaussian with low variance, and resolved to take their medians as the ground truth.Substituting these values for the original ones yields a negligible Null Reconstruction error of approximately 0.127Å (Figure A.1). Notably, this error remains unchanged even when using only 100 reference structures instead of 1,700 (Figure A.2).

Predicting χ angles using H-Packer
We aim to predict the χ angles associated with a side-chain conformation from the configuration of atoms surrounding a given residue.This atomic neighborhood is associated with the backbone and the side-chain of the neighboring residues in the structure.
During inference only the coordinates of the backbone atoms are known a priori -alongside the identity of the amino acids they belong to.However, physical interactions with the atoms of other side-chains are the true determinants of a shows the H-CNN style network for side-chain packing by first predicting the missing residue's χ angles from its surrounding atomic environment, and then using the χ angles to reconstruct the residue's side-chain.As illustrated in C, H-Packer consists of two H-CNN networks, one trained on backbone atoms only and used to make an initial guess, and one trained on full side-chain neighborhoods and used to refine the predictions.residue's conformation.Therefore, we develop H-Packer into a two-step solution (Figure 1C).Specifically, we two train models: one to predict χ angles from the backbone atoms and amino-acid identity alone, the another to predict χ angles from full neighborhoods, i.e., by including the true side-chain atoms of the surrounding residues (minus the residue of interest).At inference time, we use the first model to make an initial guess of the side-chain conformations, and then a second model to iteratively refine the predictions.
To build the individual models that predict χ angles, we start by considering their symmetries.Notably, χ angles are invariant to rigid-body transformations (translations and rotations) of the protein (i.e., they are SE(3) invariant).Translation invariance can be satisfied by choosing a well-defined center for a residue of interest; we choose the residue's C-α, as it is a common component of all residues and is at the beginning of the side-chain.Then, we still need to take into account rotational invariance about the specified center, which is associated with transformations under the rotation group SO(3).
To respect such rational symmetry, we build SO(3)-equivariant models to predict a residue's χ angles from its surrounding atomic environment.Equivariance is a generalization of invariance whereby when a function's input is transformed by the action of a certain group element (in this case rotation group SO(3)), the output is transformed by the same group element in a well-defined way; equivariant layers ensure both expressivity and efficiency when fitting both invariant and equivariant functions (see Appendix A.1 for details).To develop these models, we use an approach inspired by our previous work [9,8].We consider as input the point cloud of atoms within a radius r = 10 of the residue's C-α (with or without the neighboring side-chains).To ensure rotational equivariance, we both encode the input in a rotationally equivariant fashion (i.e., a holographic encoding), and use SO(3)-equivariant layers to predict the χ angles.

Holographic encoding of the data
We represent the point clouds of atoms within a structural neighborhood with a density function by summing over (weighted) Dirac-δ functions, indicating the presence of atoms at a given position in space: ρ(r, θ, ϕ) = i∈points ω i δ(r i − r); here, ω i indicates the weight associated with point i at position r i .We then use 3D Zernike Fourier Transform (ZFT) of the density function to encode the neighborhood into a convenient SO(3) equivariant basis, where Y ℓm (θ, ϕ) is the spherical harmonics of degree ℓ and order m, and R n ℓ (r) is the radial Zernike polynomial in 3D with radial frequency n ≥ 0 and degree ℓ.R n ℓ (r) is non-zero only for even values of n − ℓ ≥ 0. Notably, the spherical harmonics that describe the angular component of ZFT arise from the irreducible representations of the 3D rotation group SO(3), and form a convenient basis under rotation in 3D (see Appendix A.1). Zernike projections in spherical Fourier space can be understood as a superposition of spherical holograms of an input point cloud, and thus, we term this operation as holographic encoding of the data [9,8].
We truncate the Fourier expansion by the maximum degree ℓ max and a maximum radial frequency n max .Additionally, we normalize the Fourier coefficients of each Dirac-δ function by the sum of the square of its coefficients.We found this normalization to be beneficial for training, likely due to the avoidance of singularities close to the boundaries.
Following [8] and [9] we incorporate atom-level input features by dividing the holographic encoding into different channels (see Figure 1).We consider the following two sets: (i) Atomic channels: C, N, O, S, wildcard element excluding hydrogens, partial charge from the Amber99sb force field [10], and (ii) Amino-Acid channels: one for each of the 20 canonical amino-acids, plus a wildcard channel.We include the charge value in its dedicated channel as the weights ω i coupled to the point cloud's density function.While we train the initial guess model using both sets of channels (atomic and amino-acid) as input, we only consider the atomic channels for the refinement model.We do this in an effort to make the model's predictions more grounded in physical interactions.We condition both models with the identity of the residue of interest by concatenating a linear embedding of its one-hot encoding to the input's invariant (ℓ = 0) features.This is particularly necessary for the refinement model -which is trained only with atomic channelssince it wouldn't otherwise know about the identity of the residue of interest.

SO(3)-Equivariant neural network architecture
We use the resulting holograms as inputs to an SO(3)-Equivariant Convolutional Neural Network (Figure 1B).The key is to transform the inputs through the network such that all intermediate outputs of the network remain rotationally equivariant.Our resulting model is conceptually divided into three parts: First, a linear layer that projects data and conditioning to a hidden representation with same number of features per ℓ.Second, a stack of equivariant blocks connected via additive skip connections, each composed of: (i) feature-wise tensor product nonlinearity, (ii) layer norm with silu nonlinearity, and (iii) a linear layer whose output dimensions are the same as the input's.After the final block, we retain only the features of type ℓ = 0 or ℓ = 1 depending on the training objective (Section 2.2.3).It should be noted that features of type ℓ = 0 are rotationally invariant scalars, whereas those associated with ℓ = 1 are equivariant vectors that transform consistently with the input under rotation.We use ℓ = 1 features to directly learn the orientation of the intersecting planes that define a side-chain's dihedral angles χ (see Section 2.2.3).Third, optionally and only for the models with invariant (ℓ = 0 output), we apply a standard feed-forward neural network with dropout regularization and silu nonlinearity.We refer to Section A.2 in the appendix for more details on the architecture components.

Training objectives to infer χ angles
We consider three alternative parameterizations of χ angles, i.e. three possible objective functions: (i) The angle itself.χ angles are defined between −180 • and 180 • with a periodicity such that the angles −179 • and 179 • are to be considered 2 • apart, not 358 • .Thus, plain MSE loss would pose strong and unnatural constraints on the model.To account for this, we mod the predictions to fall in the valid range, and compute the loss between two angles as the minimum between the computed error and 360 • minus the error, resulting in the following loss function: where χi and χ i are the predicted and the true values of the i th χ, respectively, and N χ is the number of χ angles associated with the residue of interest.In our implementation, the χ angle domain is scaled and shifted to fall in [0, 2] to make the scale of the loss functions comparable between the three representations of the angles.
(ii) Sine and Cosine transforms of the angle.A pair of sine and cosine transformation provides an alternative representation for a χ angle that accounts for its periodicity and is also rotationally invariant; a similar approach is also considered in concurrent work [11].We directly predict sine and cosine values by feeding 8 outputs from the network to a tanh activation function, which then form the arguments of a MSE loss function: Notably, this loss function is justified by a nice geometric interpretation, whereby it is equivalent to computing the cosine loss between the 2D vectors that describe the χ angles on the unit circle (proof in Eq.A.6).
(iii) Normal vectors to the dihedral plane.χ angles are examples of dihedral angles, meaning that they are defined as the angle between two planes.For χ angles, the two planes are described by subsequent triplets of atoms along the side-chains.Any two subsequent χ angles share one plane.Therefore, any conformation with N χ angles can be alternatively described by N χ + 1 planes (or their normal vectors); one of these normal vectors is a redundant internal coordinate (defined by backbone + Cβ atoms), while others specify the N χ independent degrees of freedom.
We consider training models to predict the dihedral planes' normal vectors: n χ1 ... n χ4 .It should be noted that unlike the sine/cosine transformation, the vectors are not invariant to rotations, but equivariant of type ℓ = 1 (geometric vectors) which can be extracted from the H-Packer equivariant network.We use a cosine loss over the true and predicted vectors: Relevant symmetries in computing loss functions.Some amino acid conformations exhibit a rotation symmetry by π in some of their χ angles.For example, χ 2 of Phenylalanine and Tyrosine indicates the torsion of their benzene rings, thus a rotation by π leaves the conformation physically unchanged.However, as χ angles are formally defined by internal atom names, these equivalent conformations are associated with different χ angle values.We correct for this degeneracy by considering the minimum loss value between considering χ and π − χ as targets during training and evaluation.When computing the error on the atomic coordinates (generally via Root Mean Square Deviation, RMSD) for the full side-chain, we need to consider other such symmetries between non-χ atoms, as listed in Table A.1.

Related Work
Protein side-chain packing.Methods for side-chain packing can be divided into (older) physics-based algorithms [2,4,3,12,13] and (newer) machine learning (ML) approaches [5,6,7,11,14,15].Physics-based approaches generally work by minimizing a hand-crafted energy function over the side-chain conformational space, usually with the help of a rotamer (i.e., side-chain conformation) library to discretize and reduce the dimensionality of such space.Popular algorithms include RosettaPacker from the rosetta suite [2], FASPR [4], and SCWRL [3].Among ML methods, the most related to this work include: DLPacker [5], which frames the problem as an image-to-image translation (with "channels" analogous to ours) to predict a 3D "image" of the desired rotamer, which is then matched against a rotamer library to return a valid representation; AttnPacker [6], which uses a large ( 208M) model derived from the SE(3)-Transformer [16] to directly predict the coordinates of side-chain atoms from the backbone structure and the amino-acid sequence.The concurrent ZymePackNet [11] (open source code not available) which autoregressively predicts the sine and cosine of χ angles, using two graph neural networks in a two-step procedure similar to ours; and DiffPack [7], which consists of four expensive diffusion models over each of the χ angles, autoregressively used together at inference time.
Equivariant neural networks for protein structures.In recent years, great successes has been achieved in structural biology by leveraging the underlying geometric symmetries in modeling protein structure and surface in the form of developing neural networks that are equivariant to the relevant symmetry transformations [17,18,19,8,9].Specifically, a great deal of literature has been devoted to efficiently modeling 3D atomistic systems using neural networks equivariant to euclidean symmetries [16,20,21,22,23].The drawback is that most such methods are computationally expensive due to computing expensive tensor products between all pairs of neighboring atoms (see Section A.2 and [20,21]).Here, we greatly reduce computational complexity by constructing equivariant representations of a system about a single natural center (the central residue's C-α), following an approach originally designed to model spherical images [23].Applying this approach to residue-level structure modeling has been proven effective in predicting amino-acid propensities in protein structures [8], as well as compactly encoding residue environments in an unsupervised way for downstream tasks [9].

Toy task: inferring χ angles from atomic coordinates
We start by studying the behavior of our model on a simple task: predicting (or rather, calculating) χ angles from the true atomic coordinates of the conformation.We found this to be a useful benchmark to study our model's behavior.
Setup.We randomly select 160 structures from our real task's training set (see below) and split them into 100/30/30 for training/validation/testing, respectively.We collect conformations of all residues presenting χ angles, and consider only their heavy atoms (C, N, O, S).We then apply the Zernike encoding varying ℓ max from 1 to 5 and train models  with varying ℓ max consistent with that of the input, as well as with different prediction objectives (angles, sin-cos of angles, plane norms).Crucially, we vary the number of hidden channels (decreasing it with higher ℓ max ) to keep the number of parameters constant around 330k, and thus, removing differences in model capacity as a contributing factor to performance.We do not condition the models with amino-acid identity to make the problem more challenging, and therefore more interesting.We refer to Section A.4 for more details.
Results.Test Mean Absolute Error (MAE) per χ angle for all models is shown in Figure 2, and training curves are shown in the Appendix (Figure A.3). Notably, the Angle model performs the worst, and is unable to recover the true χ angle with negligible error.The Sin-Cos and Plane Norm models instead recover all χ angles with very low error (< 5 Å) with ℓ max > 1.It appears that ℓ max = 2 is the minimum sufficient degree nedded to solve this task with high accuracy.We note that error is higher for later χ angles.We hypothesise that this is expected for two reasons: (i) later χ angles depend on atoms that are farther way from the center of the neighborhood, thus having lower angular resolution within the Zernike representation, and (ii) there is simply less training data for them.Weighting χ angles in the loss function according to their average frequency partially mitigates the second issue (Figure A.4). Notably, the fact that the model performs well without explicit knowledge of amino acid identities implies that it can easily infer the amino acid type from the the number and the relative location of the atoms.

Side-Chain Packing
Dataset.We consider the training and validation datasets used in DLpacker [5], consisting of 19,436 structures with a maximum inter-protein sequence similarity of 50%.Unlike DLPacker, we do not remodel structures with PDB-redo [24] and do not convert selenomethionine residues into methionine.For testing our model, we use the CASP13 and CASP14 targets (82 and 64 structures, respectively).We remove from the training and validation sets any protein that has sequence similarity above 50% with any of the proteins in the test set.
H-Packer training.We used the Sin/Cos loss function (Eq. 3) as it was the best-performing loss in our toy-task; while the Plane Norms loss (Eq.4) also performed well in the toy-task, we found that models trained with the Sin/Cos objective were easier to regularize via dropout in the final invariant feed-forward neural network.The initial guess and the refinement networks were trained with the same ℓ max of 5 and n max = 12; the latter was chosen such that it included at least one radial function with wavelength lower than the minimum interatomic distance.We also considered models trained with ℓ max = 4, tuning the number of hidden features to keep the number of trainable parameters the same as the ℓ max = 5 models, and equal to ∼3M.All models were trained for 10 epochs, keeping the model with lowest validation loss at the end of an epoch; see further details in Section A.4.Throughout our experiments, we consider the performance of H-Packer models with different number of rounds of refinement.For example, H-Packer 0 denotes the model with no refinement.For each model, we also compute an upper bound in performance of the refinement process by tasking the refinement model to predict χ angles from the ground truth neighboring structures (i.e., the toy task); we denote this by H-Packer up .
Metrics.In line with previous work [6,7], we evaluate our models on three main metrics.(i) Angle-specific Mean Absolute Error (MAE), (ii) residue-level angle accuracy, defined as the proportion of residues for which the prediction of all χ angles is within 20 • of the true value, and (iii) average atomic Root Mean Square Deviation (RMSD) of side-chain atoms across residues.We further distinguish between Surface and Core residues, as conformations occurring on the surface of proteins are notoriously harder to predict.Surface residues are defined as having at most 15 β-C within 10 Å of their β-C, whereas core residues must have at least 20 β-C's in this range.

CASP13
Angle MAE Interestingly, while H-Packer predictions are improved upon using refinement networks, the performance saturates after 2 steps of refinement; the accuracies after 5 iterations of refinement are comparable to those after only 2 steps (Table 1).Therefore, it is unlikely that further refinement could improve H-Packer's performance to reach its upper bound performance.We hypothesise that training H-Packer to produce confidence scores might help in developing site-specific convergence criteria to help bridge the gap [6,7].
Ablation in ℓ max .Table 3 shows how changing ℓ max (from 4 to 5) impacts the performance of H-Packer.For the same model capacity, using higher ℓ max consistently yields better performance, indicating that higher angular resolutions of the input can be beneficial for learning this task.This performance improvement comes with a trade-off in takes 1,482s to reconstruct the 82 CASP13 targets on a single NVIDIA A40 GPU.Times for the other methods were taken from [6].training and inference time, which scale superlinearly with ℓ max unless the Tensor Product computation is adequately constrained [25].For reference, training our models with ℓ max = 5 takes ∼40% longer than those with ℓ max = 4.We leave the hyperparameter optimization of ℓ max to future work.
On computing RMSD fairly.In Table 1 we report RMSD computed by measuring the distance between the coordinates of true and predicted atoms, modulo the symmetries we report in A.1.However, other algorithms such as AttnPacker [6] consider other symmetries as well, sometimes even between atoms of differing chemical elements.Though these symmetries reflect spatially similar conformations (such as a flip of the Histidine ring), they result in inflated RMSD scores.We show the effect of this inflation on H-Packer predictions in Table 2.In the same table, we also show the RMSD computed against true structures that have been "reconstructed" using the true χ angles and the constant values that we use for redundant internal coordinate within H-Packer; we do this in an effort to disentangle the Null Reconstruction Error (Figure A.1) from the error given by mistakes in χ angle prediction.
Speed.Table 4 shows relative reconstruction speeds for several packing algorithms.Using the current implementation of the reconstruction algorithm, the best-performing H-Packer model is about 7x faster than the popular algorithm RosettaPacker and 6x faster than DLPacker; however, it is considerably slower than AttnPacker.Speed can be considerably cut down by half at the expense of minor performance degradation using two refinement iterations instead of five.However, more considerable speed gains may be achieved by CPU parallelization, when computing holographic encodings of structural neighborhood during initial data processing.Indeed, each initial guessing and refinement step of H-Packer predicts all χ angles at once, but in the current implementation holographic encodings are computed in series, creating a bottleneck that currently accounts for 88% of the inference time (10% is atom placement, and only 2% is making the actual predictions on GPU).We plan on optimizing this aspect in future iterations of the model.

Discussion
In this paper, we present H-Packer, a novel algorithm for predicting side-chain conformations by jointly regressing over the side-chain's χ angles.H-packer is composed of two simple and fast rotationally equivariant neural networks, the first one is used for making an initial guess using the coordinates of backbone atoms alongside residue identity information, while the second one refines the predictions by considering the predicted coordinates of the neighboring side-chain atoms.We carefully study three alternative objective functions, eventually deciding on using a geometrically justified loss function over the sine and cosine of χ angles.Our experiments show that H-packer is competitive against physics-based methods and some machine-learning solutions, but its performance still lags behind the state-of-the-art at predicting χ angles closer to the backbone.Overall, the lack of consistent comparative patterns in performance metrics suggests that H-Packer learns features complementary to other approaches.In addition, the formulation of H-packer makes it amenable to easy-to-achieve CPU parallelization to speed up its already fast inference predictions.We further emphasize that H-Packer is remarkably lightweight -2 × 3M parameters vs. 208M of AttnPacker -and requires few resources to train -single GPU at < 1 hour per epoch vs. 4 GPUs for 400 epochs for DiffPack (unknown total time).Limitations of the model include: its inability to distinguish between covalent and non-covalent interactions as atomic interactions are not explicitly encoded into the network, and its inherently lower angular resolution further away from a neighborhood's center.Future areas of improvement include: enhancing angular resolution by scaling up ℓ max while adjusting the architecture to reduce the resulting computational complexity, and training a confidence model for the predictions and using it to inform the refinement process.

A Appendix
A.1 More rigorous mathematical background on SO(3)-Equivariance Group Invariance and Equivariance.Intuitively, a function is said to be invariant to a certain group of transformations (e.g.3D rotations) if applying one such transformation to the function's input does not change its output.Equivariance is a generalization of invariance whereby when the input is transformed by the action of a certain group element (or rather by a matrix representation parameterized by the group element) the output of the function is transformed by the same group element (i.e., by a matrix representation parameterized by the same group element, but that can be different from the input's representation).In short, equivariant functions transform the input in the same way regardless of its coordinate frame, but do not necessarily discard the coordinate frame information, whereas invariant functions also do the latter.Both of these concepts can be extended to properties as well, e.g."the mass of a molecule remains constant (is invariant) when rotating it, whereas its dipole moment rotates alongside it (is equivariant)".More formally, a function between two vector spaces f : X → Y is said to be equivariant to a group of transformations G iff applying any group transformation to the input space of f corresponds to applying the same transformation to the output space (i.e., via a representation parametrized by the same group element).Formally: The group acts on the input and output vector spaces with space-specific representations that are appropriate for the space (i.e., D X and D Y ).A group may have different representations, and a special one is the one that always maps to the identity: D Y (g) = 1, ∀g ∈ G; a function on whose output space G acts with the identity representation is said to be invariant to G.In the context of machine learning, building models for which the output is provably invariant/equivariant to the same groups as the target function can avoid expensive data augmentation.However, even when fitting invariant functions, using equivariant layers is advisable -if not necessary [26].
Irreducible representations.How are equivariant layers generally achieved?The key is to look at the group's irreducible representations (irreps).These are the group's smallest representations, so that any possible representation can be provably decomposed into a direct sum of irreps.Therefore, the group's irreps can be used to describe how the group elements act on any vector space.We can use this fact to build group-equivariant functions by ensuring that both the input and output of the function are composed (via direct sum i.e., concatenation) of features that transform under the group's action under the group's irreps.SO(3)-Equivariance.The above is often easier said than done, but it has been worked out for SO (3), which is a group describing 3D rotations about a fixed point [27,23,20].Spherical Fourier space can be used to conveniently define equivariant transformation for rotations.For rotations about a given reference point, the points in 3D can be expressed by the resulting spherical coordinates (r, θ, ϕ) about the set origin.Since the radius r (i.e., the distance of a point from to the reference) does not change under rotations about the origin, we will ignore the radial component for now and consider a signal over the sphere of radius r, f (θ, ϕ) : S 2 (r) → R. The Fourier transform F of the signal on the sphere follows, where Y ℓm (θ, ϕ) is the spherical harmonic of degree ℓ and order m defined as where ℓ is a non-negative integer (0 ≤ ℓ), and m is an integer within the interval −ℓ ≤ m ≤ ℓ.P m ℓ (cos θ) is the Legendre polynomial of degree ℓ and order m.The operators that describe how spherical harmonics transform under rotations are called the Wigner D-matrices, denoted by D ℓ mm ′ (R) [28].
Indeed, Wigner-D matrices are the irreps of SO(3).Therefore, any vector space that "3D-rotates" can be decomposed into a direct sum of type-ℓ features that transform according to the irrep of type ℓ.For example, features of type 0 are invariant to rotation (e.g.atomic mass), while features of type 1 transform as geometric vectors (e.g.dipole moment).Thus, to build an SO(3)-equivariant model we start by projecting the data onto a convenient SO(3)-equivariant basis via the spherical harmonics.We then leverage a suite of rules that allows one to build learnable layers without breaking equivariance, and transform the input into a new representation composed of features within the same range of possible types.One key transformation rule is the Clebsch-Gordan Tensor Product [28] that is commonly used to inject nonlinearity (to be precise bi-linearity) in the SO(3) equivariant neural networks [23].

A.2 Equivariant architecture components
Linearity.The equivariant linear layer consists of a set of linear projections each acting on the set of features sharing the same type ℓ.It practically equates to ℓ max + 1 standard linear layers with no bias except for the ℓ = 0 case, and with the consideration that all 2ℓ + 1 moments of the same feature are processed together.Formally, let h ℓ ∈ R C×(2ℓ+1) be a set of C features of type ℓ.Then, we learn weight matrix W ℓ ∈ R C×K that linearly maps h Tensor Product Nonlinearity.Arguably the most important SO(3)-Equivariant operation is the Clebsch-Gordan (CG) Tensor Product.It is the only known operation capable of nonlinearly coupling (i.e.mixing information between) features of different type ℓ and with different momentum index m.The CG tensor product combines two features of degrees ℓ 1 and ℓ 2 to produce another feature of degree where C (ℓ1m1)(ℓ2m2) are the Clebsch-Gordan coefficients.Similar to spherical harmonics, Clebsch-Gordan tensor products also appear in quantum mechanics, and they are used to express couplings between angular momenta.
Here, we use the CG Tensor Product as the primary nonlinear activation of our networks, as originally prescribed by [23], and crucially the only operation that can transfer information across features of different types ℓ.Following [25] and [9], we compute the Tensor Product feature-wise (i.e.we do not compute it across features with a different index) to significantly decrease computation time.We refer the reader to [25] and [9] for details.
Layer Norm Nonlinearity.This consists of applying a standard Layer Norm [29] to the norms of type-ℓ features (which are invariant), and then feeding the normalized norm into a standard nonlinear activation function.This layer effectively combines the equivariant layer norm in e3nn [27] with the Norm Nonlinearity originally used in Harmonic Networks [30], and then adapted to the SO(3) domain by Tensor Field Networks [20].It is also used in the SE(3)-Transformer [16,31].
The SO(3)-equivariant architecture components were implemented with the help of e3nn primitives [27].We emphasize that we do not ablate all the architectural components, and different choices would be possible.Specifically, we do not ablate over the choice of normalization and of invariant nonlinearity function: in preliminary experiments, other options seemed to perform equally.Other components are necessary or were otherwise found to be useful.For example the tensor product is necessary to ensure that information flows across different ℓs, and tuning dropout was found to be greatly useful to prevent overfitting.

A.3 Proof of equivalence between MSE and cosine loss in predicting Sine and Cosine of χ angles
Let χ i and χi be the true and the predicted angle values for a residue's i th χ angle, respectively.Let v i ≡ [sin χ i , cos χ i ] and vi ≡ [sin χi , cos χi ] denote their respective 2D vectors on the unit circle, whose two components are the sine and cosine of the angles themselves.Then we have: where the jump from step 1 to step 2 is granted by the trigonometric identity sin 2 θ + cos 2 θ = 1.

A.4.1 Toy Task
All models were constructed with 5 equivariant blocks and no invariant feed-forward neural network (FFNN) for the two models predicting invariant quantities.All models were trained using Adam [32]   Models with ℓ max = 4 have five equivariant blocks with a per-ℓ hidden feature side of 128.For models with ℓ max = 5 we use 5 blocks with per-ℓ size 96.In doing so, the two models have comparable number of parameters (3M).All models have a 3-layer FFNN with silu nonlinearity and dropout normalization rate of 0.1.We found it useful to tune he dropout rate to prevent overfitting.We train all models for 10 epochs, keeping the model with lowest validation loss at the end of an epoch (convergence usually happened by epoch ∼8); models with ℓ max = 5 took roughly 50 minutes per epoch to train on a single NVIDIA A40 GPU, while models with ℓ max = 4 took 35 minutes.

Figure 1 :
Figure 1: Overview of H-Packer.A: Illustration of Glutamine's χ angles, of which there are three.B: Schematic shows the H-CNN style network for side-chain packing by first predicting the missing residue's χ angles from its surrounding atomic environment, and then using the χ angles to reconstruct the residue's side-chain.As illustrated in C, H-Packer consists of two H-CNN networks, one trained on backbone atoms only and used to make an initial guess, and one trained on full side-chain neighborhoods and used to refine the predictions.

Figure 2 :
Figure 2: Test MAE for the simple task of predicting χ angles from atomic conformation.Panels show reconstruction accuracies using three loss functions: the angle χ itself (left), the sin/cos transform of the angle (center), and the normal vectors to the dihedral planes (right), for different maximum angular degrees ℓ max (colors).

Figure A. 2 :
Figure A.2: Difference in Null Reconstruction Error on Test data by using a smaller set of structures for reference.

Figure A. 3 :A. 5 : 5 , part 1 .A. 7 : 5 ,
Figure A.3: χ angles error trace during training for the simple task of predicting χ angles from true amino-acid conformations.The Sin-Cos and Plane Norms models show comparable convergence curves, except for ℓ max = 1 where the Plane Norms model struggles with χ 3 and χ 4 .The Angles model has the worst convergence trace, even overfitting for high ℓ max .

Table 1 :
[7]parative assessment on CASP13 and CASP14.We present best results in bold and second-best underlined.Italicized results represent an upper bound to our algorithm's performance.Performance of models other than H-Packer is taken from[7].

Table 2 :
55]2,4]3,2,4]) across all residues with different treatments of the true structure.Rec: reconstructing the true structure with our data-derived redundant internal coordinates.Sym: considering the additional non-natural symmetries used by AttnPacker.Comparative Evaluation on CASP13 and CASP14 targets.Table1compare H-Packer's performance in side-chain packing with other computational methods[6,7,5,3,2,4].Despite its simplicity, H-Packer 5 is competitive against the state-of-the-art at predicting χ 3 and χ 4 , but falls behind on χ 1 and χ 2 predictions.This discrepancy indicates that H-Packer has likely learned complementary features to the other models.Moreover, H-Packer mostly outperforms the physics-based computational algorithms[3,2,4]and is competitive with DLPacker[5]in all our performance metrics.Interestingly, H-Packer is consistently better than physics-based approaches in terms of overall Atom RMSD, but tends to fall shorter on Angle Accuracy.We present error distrubtions for H-Packer in Figures A.5, A.6, A.7, and A.8.

Table 3 :
Ablation in ℓ max .Metrics for the other H-packer models can be found in Table A.2.

Table 4 :
Relative times to undertake full atomic reconstruction.In our current (unoptimized) implementation, for 30 epochs with learning rate 0.001 and a batch size of 32.The model exhibiting lowest validation loss was used.Null Reconstruction Error on Test data by using data-derived values as redundant internal coordinates.Effectively, each redundant internal coordinate (i.e.excluding chi angles) gets substituted from a single value computed as the median of the corresponding value in a reference dataset.Across all structures, the average Null Reconstruction RMSD is 0.127 Å.We notice a small number of outliers (single residues with abnormally high RMSD), but we do not investigate the causes.

Table A . 1 :
List of conformational symmetries that we take into consideration.